Fix callback not being called after activesock shutdown (#4864)#4878
Fix callback not being called after activesock shutdown (#4864)#4878nanangizz merged 3 commits intopjsip:masterfrom
Conversation
c60348f to
e608796
Compare
|
This fixes the issue as described & reproduced in #4864. I am moderately confident that this is a necessary and sufficient fix, however the whole architecture around TLS transport is quite difficult to get my head around. There are many points where a |
There was a problem hiding this comment.
Pull request overview
Fixes a shutdown edge-case in PJLIB’s active socket write-completion path so higher layers still receive the send-completion callback, preventing reference leaks (e.g., callers that decrement pjsip_tx_data refs in on_data_sent).
Changes:
- Invoke
on_data_sentwhenSHUT_TXis set inioqueue_on_write_complete()(instead of returning silently).
226b170 to
af16a17
Compare
|
Actually with further testing I encountered another case where the callback still isn't called. This requires more thorough troubleshooting. One major pain point is that the issue is only reproducible in high-load environments or with the ioqueue "fast track" commented out; else Marking this as draft until I figure out a more thorough solution. |
The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction
|
Fixed a second leak. At this point I've lost all confidence that the original implementation has any understanding of the importance of not losing track of ioqueue keys. I'll make a thorough review next week. |
The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.
The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction
4a54cd5 to
b3440a7
Compare
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
b3440a7 to
47cead4
Compare
The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.
The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
47cead4 to
31f22f4
Compare
31f22f4 to
bc22897
Compare
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
d395b2b to
bc60815
Compare
|
@azertyfun We've tried to address the issues mentioned earlier. Please let us know if you notice any further issues or have additional comments. |
|
We deployed the earlier fix (which sends garbage over the wire – but that doesn't really matter in our use-case). However we had to roll it back because we unexpectedly encountered crashes. I am currently investigating the cause. I still think the fix is sound in principle, but it may be triggering other existing race conditions. In particular this bit seems faulty: if the ioqueue is busy, |
|
You're right. Asked the AI to check that area, and it found several issues!!! 1. Wrong app_key passed to ssl_send 2. PJ_EPENDING not handled — double-send 3. Missing callback on PJ_SUCCESS — tdata leak 4. send_buf_pending not cancelled on error in ssock_on_data_sent Here is the draft patch (compile tested only, will test further next week). |
|
After further investigation, the crash I described above was one of the outcomes, but the more common one was that we saw Now it is clearly related to the Thanks to the error message we know t hat From there I can only speculate why there was a deadlock, but The whole logic for SSL handshaking is spread out, has multiple redundancies, and probably just as buggy as the rest of the code touched by this PR. Another thorough review is needed, but that's going to be quite a few additional man-hours which I'm starting to run short on as other priorities are catching up after several weeks working on this already. And anyway, how can we know when we've squashed the last critical bug? How can we verify our fix when the interfaces are so complex and behave in radically different ways under load vs not, including in edge-cases like SSL renegotiation under load ? Unfortunately given that I caused a major incident by pushing a partial fix, we won't get approval for another attempt unless we have a complete explanation for all issues encountered and a way to reliably reproduce our race conditions and verify the correctness of any fixes. Perhaps the solution for us will be to wait until |
…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code
bc60815 to
aacb3bc
Compare
|
@azertyfun Thanks for the detailed crash report and analysis. You're right — the We've removed the SSL change from this PR. The ioqueue-level fix (draining The SSL Could you try deploying this updated version (without the SSL change)? It should fix the tdata leak without triggering the renegotiation crash. |
|
We already had a major customer-impacting incident which unfortunately means at this point I won't be able to get approval to push further fixes in production without very strong guarantees. We are now mitigating the TLS issues by staying on version 2.15.1 but reducing the number of registrations per TLS connection. We will need proper assurances (refactor to enable thorough unit test coverage of the SSL subsystem, and probably additional integration testing with asterisk) before I will be allowed to push a fix through. (To be clear: for the purposes of this PR I have no problem if it gets merged as-is, I just won't be able to verify that it fixes our specific production issues) |
- Add PJ_UNUSED_ARG for status/sent variables in pj_ioqueue_send(), pj_ioqueue_sendto(), and pj_ioqueue_accept() when fast-track is disabled, fixing MSVC C4101 warnings. - Fix activesock UDP echo test: accept PJ_EPENDING as non-error from sendto() since async path always returns PJ_EPENDING when fast-track is disabled. Also fix error message to print actual send status instead of the recvfrom callback status. Co-Authored-By: Claude Code
Replace the ring-buffer-based send mechanism (send_buf + single send_buf_pending slot) with per-op pool-allocated ssl_send_op_t using embedded encrypted data buffer, free list with configurable cap, and true memory release on discard. This eliminates the PJ_ENOMEM crash during renegotiation reported in #4878. Key changes: - New ssl_send_op_t with per-op pool for true memory release - ssl_do_handshake_and_flush() wrapper unifying error handling across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel, Darwin) - Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING to prevent double-send, invoke callback on synchronous success - Fix ssock_on_data_sent: check app_key (not send_key) for handshake/shutdown detection (was dead code before) - Fix ssock_on_connect_complete: return via on_handshake_complete instead of bare PJ_ENOMEM from pj_bool_t function - Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for clarity (not always circular — OpenSSL uses memory BIO) - Add "Caller must hold write_mutex" documentation to ssl_write() in all 6 backends Test improvements: - send_load_test: 200 rapid sends with async callback verification - large_msg_test: 64KB message (multi-TLS-record) echo verification - close_pending_test: close socket with pending sends (no crash) - bidir_test: bidirectional simultaneous send load - mt_send_load_test: 3 worker threads + 3 concurrent clients CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan) to force async send path in activesock and ssl_sock tests. Co-Authored-By: Claude Code
Replace the ring-buffer-based send mechanism (send_buf + single send_buf_pending slot) with per-op pool-allocated ssl_send_op_t using embedded encrypted data buffer, free list with configurable cap, and true memory release on discard. This eliminates the PJ_ENOMEM crash during renegotiation reported in #4878. Key changes: - New ssl_send_op_t with per-op pool for true memory release - ssl_do_handshake_and_flush() wrapper unifying error handling across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel, Darwin) - Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING to prevent double-send, invoke callback on synchronous success - Fix ssock_on_data_sent: check app_key (not send_key) for handshake/shutdown detection (was dead code before) - Fix ssock_on_connect_complete: return via on_handshake_complete instead of bare PJ_ENOMEM from pj_bool_t function - Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for clarity (not always circular — OpenSSL uses memory BIO) - Add "Caller must hold write_mutex" documentation to ssl_write() in all 6 backends Test improvements: - send_load_test: 200 rapid sends with async callback verification - large_msg_test: 64KB message (multi-TLS-record) echo verification - close_pending_test: close socket with pending sends (no crash) - bidir_test: bidirectional simultaneous send load - mt_send_load_test: 3 worker threads + 3 concurrent clients CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan) to force async send path in activesock and ssl_sock tests. Co-Authored-By: Claude Code
Replace the ring-buffer-based send mechanism (send_buf + single send_buf_pending slot) with per-op pool-allocated ssl_send_op_t using embedded encrypted data buffer, free list with configurable cap, and true memory release on discard. This eliminates the PJ_ENOMEM crash during renegotiation reported in #4878. Key changes: - New ssl_send_op_t with per-op pool for true memory release - ssl_do_handshake_and_flush() wrapper unifying error handling across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel, Darwin) - Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING to prevent double-send, invoke callback on synchronous success - Fix ssock_on_data_sent: check app_key (not send_key) for handshake/shutdown detection (was dead code before) - Fix ssock_on_connect_complete: return via on_handshake_complete instead of bare PJ_ENOMEM from pj_bool_t function - Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for clarity (not always circular — OpenSSL uses memory BIO) - Add "Caller must hold write_mutex" documentation to ssl_write() in all 6 backends Test improvements: - send_load_test: 200 rapid sends with async callback verification - large_msg_test: 64KB message (multi-TLS-record) echo verification - close_pending_test: close socket with pending sends (no crash) - bidir_test: bidirectional simultaneous send load - mt_send_load_test: 3 worker threads + 3 concurrent clients CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan) to force async send path in activesock and ssl_sock tests. Co-Authored-By: Claude Code
|
@azertyfun Thanks again for the detailed report and analysis on this issue. Sorry for the problems it caused in your production environment. Building on the ioqueue fix merged here, we've done a broader SSL socket send path refactor in #4909 — replacing the ring-buffer mechanism with per-op pool allocation, fixing several related bugs found during the process, and adding stress tests that cover the high-load/concurrent scenarios you described. Unfortunately we don't currently have an Asterisk test setup, so we haven't been able to validate against that environment. Would appreciate any feedback or suggestions, especially if there are specific patterns from your environment we should add to the test coverage. |
Add renegotiation option to send_load_test and bidir_test: - renego_at parameter triggers pj_ssl_sock_renegotiate() mid-stream - Supports both client-initiated and server-initiated renegotiation - Handle PJ_EBUSY from sends during renegotiation - Guard with #if for OpenSSL/GnuTLS only (TLS 1.2 required) Known failure: sends after renegotiation return PJ_EPENDING/PJ_EBUSY but callbacks never fire — flush_delayed_send doesn't deliver them. This is the pre-existing bug reported in #4878. Add concurrent_send_test (multi-threaded send on same socket): - 3 threads blast-send on the same SSL socket simultaneously - Simulates production SIP where worker threads share a TLS transport - Detects out-of-order TLS records (connection error) and data loss Add PJ_RACE_ME(5) in ssl_send between ssl_write and flush to widen the race window for concurrent send testing. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to actively probe for race conditions. Co-Authored-By: Claude Code
Enhance close_pending_test to verify that on_data_sent callbacks fire for all pending sends when the socket is destroyed. Previously the test only checked for crashes, not callback delivery. This catches the #4878 scenario where app resources leak due to lost callbacks. Add destroy_during_renego_test: triggers renegotiation mid-blast so that sends are delayed in write_pending, then immediately closes the socket. Verifies that error callbacks (-PJ_ECANCELLED) fire for all delayed sends during ssl_on_destroy. Both tests track err_cb_cnt separately to confirm that the destroy handler (not normal completion) fires the callbacks. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Renegotiation variants of send_load and bidir tests: client and server initiated, verify delayed sends flush after completion - destroy_during_renego_test: close socket while renegotiation is in progress, verify error callbacks fire for delayed sends CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Renegotiation variants of send_load and bidir tests: client and server initiated, verify delayed sends flush after completion - destroy_during_renego_test: close socket while renegotiation is in progress, verify error callbacks fire for delayed sends CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Renegotiation variants of send_load and bidir tests: client and server initiated, verify delayed sends flush after completion - destroy_during_renego_test: close socket while renegotiation is in progress, verify error callbacks fire for delayed sends CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Client renegotiation variants (OpenSSL only): send_load, bidir, destroy-during-renegotiation - Server renegotiation variant (OpenSSL + GnuTLS): send_load Add RUN_SUBTEST macro to log failing sub-test name in output. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Client renegotiation variants (OpenSSL only): send_load, bidir, destroy-during-renegotiation - Server renegotiation variant (OpenSSL + GnuTLS): send_load Add RUN_SUBTEST macro to log failing sub-test name in output. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Client renegotiation variants (OpenSSL only): send_load, bidir, destroy-during-renegotiation - Server renegotiation variant (OpenSSL + GnuTLS): send_load Add RUN_SUBTEST macro to log failing sub-test name in output. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Client renegotiation variants (OpenSSL only): send_load, bidir, destroy-during-renegotiation - Server renegotiation variant (OpenSSL + GnuTLS): send_load Add RUN_SUBTEST macro to log failing sub-test name in output. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Client renegotiation variants (OpenSSL only): send_load, bidir, destroy-during-renegotiation - Server renegotiation variant (OpenSSL + GnuTLS): send_load Add RUN_SUBTEST macro to log failing sub-test name in output. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code
Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Client renegotiation variants (OpenSSL only): send_load, bidir, destroy-during-renegotiation - Server renegotiation variant (OpenSSL + GnuTLS): send_load Add RUN_SUBTEST macro to log failing sub-test name in output. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code

This is critical to avoid a reference leak. The callback is responsible for calling pjsip_tx_data_dec_ref.
Background: When
pj_ioqueue_unregister()is called with pending async writes,on_write_completecallbacks are silently skipped. This causes tdata reference leaks since upper layers rely on these callbacks to callpjsip_tx_data_dec_ref(). The bug only manifests when the ioqueue fast-track fails (high load, full send buffer), making it nearly invisible in normal testing. Exists with or withoutPJ_IOQUEUE_CALLBACK_NO_LOCK.Approach: Drain pending write callbacks during key unregistration, handling two lists with different semantics:
write_cb_list(completed ops, deferred callback) invoked with success status respectingwrite_callback_threadserialization, thenwrite_list(pending unsent ops) invoked with-PJ_ECANCELLED.Maintainer rework addressing review feedback:
grp_lockref counting.-PJ_ECANCELLED;write_callback_threadproperly cleared in closing path.on_data_sentwhenSHUT_TXset (noton_data_read).PJ_IOQUEUE_FAST_TRACKconfig to disable fast-track for testing async path.SO_SNDBUFshrink to force async path.Note: SSL socket
send_buf_pendingfixes (flush_delayed_send, ssock_on_data_sent error path) have been deferred to a separate PR. The SSL send path has pre-existing issues that require a more thorough refactor — see discussion in review comments. The ioqueue-level fix in this PR is sufficient to address the original #4864 tdata leak.